We hereby state that all work provided in this document is our own, and work referred from other sources have been sited. We confirm to have adhered to the St. Clair College’s Academic Integrity Policy
R : R version 4.2.2 (2022-10-31 ucrt)
RStudio 2022.12.0+353 “Elsbeth Geranium” Release (7d165dcfc1b6d300eb247738db2c7076234f6ef0, 2022-12-03) for Windows Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) RStudio/2022.12.0+353 Chrome/102.0.5005.167 Electron/19.1.3 Safari/537.36
library(ggplot2)
library(ggthemes)
library(dplyr)
library(plotly)
The Data was collected by an annual survey conducted by the CDC (USA), called the Behavioral Risk Factor Surveillance System (BRFSS). It consists of questions related to multiple factors such as Age, Lifestyle Habits, Various Medical Conditions, Gender, Financial Situations, and Education. This is done to figure out the tests and preventive measures needed for preventing heart diseases.
ds = read.csv("Heart Disease Health Indicator.csv")
head(ds)
## HeartDiseaseorAttack HighBP HighChol CholCheck BMI Smoker Stroke Diabetes
## 1 0 1 1 1 40 1 0 0
## 2 0 0 0 0 25 1 0 0
## 3 0 1 1 1 28 0 0 0
## 4 0 1 0 1 27 0 0 0
## 5 0 1 1 1 24 0 0 0
## 6 0 1 1 1 25 1 0 0
## PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost
## 1 0 0 1 0 1 0
## 2 1 0 0 0 0 1
## 3 0 1 0 0 1 1
## 4 1 1 1 0 1 0
## 5 1 1 1 0 1 0
## 6 1 1 1 0 1 0
## GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
## 1 5 18 15 1 0 9 4 3
## 2 3 0 0 0 0 7 6 1
## 3 5 30 30 1 0 9 4 8
## 4 2 0 0 0 0 11 3 6
## 5 2 3 0 0 0 11 5 4
## 6 2 0 2 0 1 10 6 8
str(ds)
## 'data.frame': 253680 obs. of 22 variables:
## $ HeartDiseaseorAttack: num 0 0 0 0 0 0 0 0 1 0 ...
## $ HighBP : num 1 0 1 1 1 1 1 1 1 0 ...
## $ HighChol : num 1 0 1 0 1 1 0 1 1 0 ...
## $ CholCheck : num 1 0 1 1 1 1 1 1 1 1 ...
## $ BMI : num 40 25 28 27 24 25 30 25 30 24 ...
## $ Smoker : num 1 1 0 0 0 1 1 1 1 0 ...
## $ Stroke : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Diabetes : num 0 0 0 0 0 0 0 0 2 0 ...
## $ PhysActivity : num 0 1 0 1 1 1 0 1 0 0 ...
## $ Fruits : num 0 0 1 1 1 1 0 0 1 0 ...
## $ Veggies : num 1 0 0 1 1 1 0 1 1 1 ...
## $ HvyAlcoholConsump : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AnyHealthcare : num 1 0 1 1 1 1 1 1 1 1 ...
## $ NoDocbcCost : num 0 1 1 0 0 0 0 0 0 0 ...
## $ GenHlth : num 5 3 5 2 2 2 3 3 5 2 ...
## $ MentHlth : num 18 0 30 0 3 0 0 0 30 0 ...
## $ PhysHlth : num 15 0 30 0 0 2 14 0 30 0 ...
## $ DiffWalk : num 1 0 1 0 0 0 0 1 1 0 ...
## $ Sex : num 0 0 0 0 0 1 0 0 0 1 ...
## $ Age : num 9 7 9 11 11 10 9 11 9 8 ...
## $ Education : num 4 6 4 3 5 6 6 4 5 4 ...
## $ Income : num 3 1 8 6 4 8 7 4 1 3 ...
sum(is.null(ds))
## [1] 0
sum(is.na(ds))
## [1] 0
The Data has 253,680 records and 22 Variables, with no null or NA values.
The Dataset was imported into the dataframe ds. No changes were made to this variable, however, changes were made to a few colums in order to better represent the data. The changes include:
The core aspect of the data set is Heart Disease, therefore it is of interest to find the distribution of Heart disease among the given data set. A bar plot is used as the HeartDiseaseorAttack is a categorical variable.
# Set color for plot
my_col = c("#7eb77f", "#d81e5b")
ggplot(ds, aes(factor(HeartDiseaseorAttack), fill = factor(HeartDiseaseorAttack))) +
geom_bar() +
scale_x_discrete(labels = c("0" = "NO", "1" = "YES")) +
scale_fill_manual(values = my_col, labels = c("0" = "NO", "1" = "YES")) +
labs(
title = "Distribution of Heart Disease or Attack",
x = "Heart Disease or Attack",
y = "Count",
caption = "Source: BRFSS 2015 | Kaggle",
fill = "Heart Disease/Attack"
) +
theme_hc() +
theme(
axis.title = element_text()
)
In this plot, the Sex variable is mapped to a ‘bar’ plot to identify how the data is distributed. Since Sex is also a categorical variable, the bar plot is used again to find the distribution.
ggplot(ds, aes(x = factor(Sex), fill= factor(Sex))) +
geom_bar(stat = "count") +
scale_x_discrete(labels = c("0" = "Female", "1" = "Male")) +
scale_y_continuous(breaks = scales::breaks_width(25000)) +
scale_fill_manual(values = my_col, labels = c("0" = "Female", "1" = "Male")) +
labs(x = "Gender",
y = "Count",
title = "Distribution of Gender",
caption = "Source: BRFSS 2015 | Kaggle",
fill = "Gender"
) +
theme_hc() +
theme(
axis.title = element_text()
)
In this plot, the PhysHlth variable is mapped to a ‘area’ plot to identify how the data is distributed. This variable was directed as a question asking “Out of the past 30 days, how many were you physically sick?”. Therefore the data is a continuous range of 0-30 days.
ggplot(ds, aes(x = PhysHlth)) +
geom_area(stat = "bin", fill = "#8bbf7e") +
labs(x = "Physical Health",
y = "Count",
title = "Distribution of Physical Health",
caption = "Source: BRFSS 2015 | Kaggle"
) +
theme_hc() +
theme(
axis.title = element_text()
)
The BMI variable is another continuous variable that is noted in the data set. We use a histogram this time with a binwidth of 5 to identify the distribution of the data according to the ranges of the BMI.
ggplot(ds, aes(BMI)) +
geom_histogram(binwidth = 5, fill = "#8bbf7e") +
labs(x = "BMI",
y = "Count",
title = "Distribution of BMI",
caption = "Source: BRFSS 2015 | Kaggle"
) +
theme_hc() +
theme(
axis.title = element_text()
)
To understand the impact of having difficulty to walk has on mental health, we map the DiffWalk variable and the MentHlth variable to a ‘boxplot’. Since one of the variable is continuous and the other discreet, we use the box plot to isolate outliers and identify the distribution of the data.
ggplot(ds, aes(factor(DiffWalk), MentHlth, fill = factor(DiffWalk))) +
geom_boxplot() +
scale_x_discrete(labels = c("0" = "N0", "1" = "YES")) +
scale_fill_manual(values = my_col, labels = c("0" = "NO","1" = "YES")) +
labs(x = "Difficulty Walking",
y = "No of Mentally sick days",
title = "Difficulty Walking vs Mental Health",
caption = "Source: BRFSS 2015 | Kaggle",
fill = "Difficulty Walking"
) +
theme_hc() +
theme(
axis.title = element_text(),
) +
guides(fill = FALSE)
To identify the effect of Age on Heart disease, the Age variable is mapped in a ‘dodged-bar’ plot with the HeartDiseaseorAttack variable mapped to the fill aesthetic. This will help differentiate the distribution of the age ranges according to Heart Disease.
ggplot(ds, aes(x = factor(Age), fill = factor(HeartDiseaseorAttack))) +
geom_bar(stat = "count", position = "dodge") +
scale_x_discrete(labels = c("1" = "Group 1", "2" = "Group 2", "3" = "Group 3", "4" = "Group 4",
"5" = "Group 5", "6" = "Group 6", "7" = "Group 7", "8" = "Group 8",
"9" = "Group 9", "10" = "Group 10", "11" = "Group 11",
"12" = "Group 12", "13" = "Group 13")) +
scale_fill_manual(values = my_col, labels = c("0" = "NO", "1" = "YES")) +
labs(x = "Age",
y = "Count",
title = "Heart Disease Across the Age Groups",
caption = "Source: BRFSS 2015 | Kaggle",
fill = "Heart Disease/Attack"
) +
theme_hc() +
theme(
axis.title = element_text(),
axis.text.x = element_text(angle = 90)
)
To Identify how heart disease is spread across the different income levels surveyed, the Income variable is plotted in a ‘stacked-bar’ plot that is flipped on it’s y-axis. The HeartDiseaseorAttack variable is mapped to the fill aesthetic in order to observe the distribution across the different income ranges.
ggplot(ds, aes(fill = factor(HeartDiseaseorAttack), x=factor(Income))) +
geom_bar(position = "stack", width = 0.75) +
scale_x_discrete(labels = c("1" = "Level 1", "2" = "Level 2", "3" = "Level 3", "4" = "Level 4",
"5" = "Level 5", "6" = "Level 6", "7" = "Level 7", "8" = "Level 8")) +
scale_fill_manual(values = my_col, labels = c("0" = "NO","1" = "YES")) +
labs(
x = "Income Ranges",
y = "Count",
title = "Heart Disease Across the Income Ranges",
caption = "Source: BRFSS 2015 | Kaggle",
fill = "Heart Disease/Attack"
) +
theme_hc() +
theme(
axis.title = element_text()
) +
guides(color = FALSE) +
coord_flip()
In this plot the BMI and PhysHlth variables are compared with each other, categorizing them by the GenHlth variable from the dataset. We try to differentiate the individuals with and without history of heart disease using the ‘size’ and ‘alpha’ aesthetics on HeartDiseaseorAttack. This is done so as to identify the density of the data according to the different GenHlth categories. A scatterplot is utilized as it is popular in identifying relationships between the variables.
#Custom color for plot
col_facet = c("#ee6352","#e9d758","#363635","#57a773","#484d6d")
#Creating a modified dataframe for clearer labels
df <- ds %>% mutate(
GenHlth = recode(GenHlth,
"1" = "Category 1",
"2" = "Category 2",
"3" = "Category 3",
"4" = "Category 4",
"5" = "Category 5"))
ggplot(df, aes(BMI, PhysHlth)) +
geom_point(aes(alpha=factor(HeartDiseaseorAttack), size=factor(HeartDiseaseorAttack), color=factor(GenHlth))) +
geom_smooth(se=FALSE, linewidth = 1.5) +
facet_wrap(vars(factor(GenHlth)), scale = "fixed") +
scale_alpha_manual(values = c(0.75,0.25), labels = c("0" = "NO", "1" = "YES")) +
labs(
title = "BMI vs. Number of Sick Days Experienced",
subtitle = "Relation of BMI and No. of sick days, as per General Health",
x = "BMI",
y = "No. Of Sick Days",
caption = "Source: BRFSS 2015 | Kaggle",
color = "General Health Category",
size = "Heart Disease/Attack",
alpha = "Heart Disease/Attack"
) +
theme_hc() +
theme(
axis.title = element_text(),
legend.position = "right"
) +
scale_size_discrete(labels = c("0" = "NO", "1" = "YES")) +
scale_color_manual(values = col_facet)
In this interactive plot the MentHlth variable is compared across every Education level, and segregated by HeartDiseaseorAttack variable. The intention is to find the effect on mental health that individuals of different education levels, and understand the same in either case of cardiac illness or otherwise.
A Box plot was chosen in order to isolate the outliers, and observe the distribution of data across all the education levels.
#Custom color for plot
comp_color = c("#a50104","#fcba04","#5448c8","#b4adea","#003b36","#7ae7c7")
#Creating a modified dataframe for clearer labels
cdf <- ds %>% mutate(
HeartDiseaseorAttack = recode(HeartDiseaseorAttack,
"0" = "No Heart Disease/Attack",
"1" = "Heart Disease/Attack"),
Education = recode(Education,
"1" = "Level 1", "2" = "Level 2", "3" = "Level 3",
"4" = "Level 4", "5" = "Level 5", "6" = "Level 6"))
plot <- ggplot(cdf) +
geom_boxplot(aes(factor(Education), MentHlth, fill = factor(Education))) +
facet_wrap(~factor(HeartDiseaseorAttack)) +
scale_fill_manual(values = comp_color) +
labs(
title = "Education vs. No. of Stressful Days",
subtitle = "Correlation of Education and Mental Health, between people with/without Heart Conditions",
caption = "Source: BRFSS 2015 | Kaggle",
x = "Education",
y = "No of Stressful days",
fill = "Education Class"
) +
theme_hc() +
theme(axis.title = element_text(),
axis.text.x = element_text(angle = 90))
ggplotly(plot) %>%
layout(yaxis = list(fixedrange = TRUE),
xaxis = list(fixedrange = TRUE)) %>%
config(displayModeBar = FALSE)
The plot can be interacted with by hovering over the individual box plots to obtain details such as Minimum value, Lower Limit, Quantile 1, Median, Quantile 3, Upper Limit, and Maximum value. Each of the outliers can be hovered over to get the corresponding value of “MentHlth”. Each of the label on the legend can be selected to remove/select each of the levels, allowing easy comparisons.
- In what ways do you think data visualization is important to understanding a data set?
Vizualization is important when it comes to summarising the large amount of data into simple, clear, and understandable information. Where the data set provides raw numbers, vizualisation can help in narrowing down on a few key variables and literally picture the data.
- In what ways do you think data visualization is important to communicating important aspects of a data set?
When communicating the results of analysis, presenting the findings in mathematical or statistical manner might not be understandable to everyone. As the saying goes “A picture is worth a thousand words”, a good vizualization would be clear to understand with few to no explaination, making it an effective communicating tool.
- What role does your integrity as an analyst play when creating a data visualization for communicating results to others?
When creating Data Vizualizations it is important to consider the business requirement or the outcome of the analysis. Much care must be taken into to consideration to avoid bais and present concrete findings with the data to back it up.
- How many variables do you think you can successfully represent in a visualization? What happens when you exceed this number?
In a vizualization, a maximum of 4 variables can be plotted and successfully represented. In case of more variables, the viz becomes cluttered and unreadable.